Chromosome-level genome assembly of Chouioia cunea Yang, the parasitic wasp of the fall webworm

Chouioia cunea Yang 1989 is a parasitic wasp of many lepidopteran insects during their pupal stage, and has been successfully used to control pests such as the fall webworm Hyphantria cunea. Here we reported the chromosome-level genome of C. cunea by using short (MGI-SEQ), long (Oxford Nanopore), chromatin-linked (Hi-C) sequencing reads and transcriptomic data, representing the first chromosome-level genome of parasitic wasps of the family Eulophidae. The total assembly length is 171.99 Mb, containing 6 pesudo-chromosomes with a GC content of 36.89% and the scaffold/contig N50 length of 31.70/26.52 Mb. The BUSCO completeness of the assembly was estimated to be 98.7%. A total of 12,258 protein-coding genes (PCGs), 10,547 3′-UTRs, and 10,671 5′-UTRs were annotated. This high-quality genome is an important step toward a better understanding of the genomes of the Eulophidae (Chalcidoidea), and will serve as a valuable resource for analyses of phylogenetic relationships and the evolution of Hymenoptera.

www.nature.com/scientificdata www.nature.com/scientificdata/ chromosome-level genome assembly of the Eulophidae. The genome assembly size of C. cunea was 171.99 Mb with the scaffold/contig N50 length being 31.70/26.52 Mb ( Table 2). The assembly of Hi-C technology scaffolding reached chromosome-level with six pseudo-chromosomes containing only five gaps (Fig. 1, 2). 26.06 Mb repeat sequences were identified, accounting for 15.15% of the C. cunea genome. A total of 12,258 protein-coding genes were predicted. Additionally, phylogenetic analyses based on single-copy genes were constructed to understand the relationship between C. cunea and other species. As the first high-quality chromosome-level genome assembly for Eulophidae, this genome is a valuable resource for enhancing our understanding of the evolutionary relationship of chalcidoids and enabling comparative studies on the biology, behavior, and genetic evolution of C. cunea. www.nature.com/scientificdata www.nature.com/scientificdata/

Methods
Insect collection and sequencing. The species Chouioia cunea were originally obtained from the Keyun company in Henan Province, China. All insect colonies were maintained in an environment-controlled room (26 ± 1°C, 60 ± 10% RH and 16: 8 h L: D photoperiod). The voucher specimens were stored in 100% ethanol and kept in the Parasitic Hymenoptera Collection of the Institute of Insect Sciences, Zhejiang University. DNA samples were extracted from adult individuals using Blood & Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany). Long-reads and short-reads (library insert size: 350 bp) sequencing were performed on Nanopore PromethION platform (Oxford Nanopore Technologies, UK) and MGISEQ. 2000 (MGI, Shenzhen, China), respectively. In total, 48.88 Gb long reads (~284.2 coverage) and 25.37 Gb short reads (~147.5 coverage) were generated (Table 1). Short-reads transcriptomes of larvae and adults were sequenced on the same platforms as DNA samples. For full-length transcriptome, the Invitrogen Trizol kit was used to extract RNA. Purified mRNA isolated from adult individuals were fragmented with divalent cations under increased temperature. These short fragments were used as templates to synthesize the first-strand cDNA using hexamer primers and superscriptTMIII (InvitrogenTM, Carlsbad, CA, USA). Second-strand cDNA was then synthesized in a solution containing buffer, dNTP, RNaseH, and DNA polymerase I and subsequently purified using a QiaQuick PCR extraction kit (Qiagen). EB buffer was used to resolve these short fragments for end reparation and poly A addition. The sequence adaptors were linked to two ends of short cDNA sequences, and suitably sized cDNA fragments were selected out for PCR amplification based on the agarose gel electrophoresis results. Finally, the library established was sequenced using a PromethION platform. The total data generated from full-length sequencing was 13.06 Gb, while the total data generated from short-read sequencing was 20.99 Gb ( Table 1).
The Hi-C library was constructed using the restriction endonuclease DpnII. In brief, the adult individuals of C. cunea were fixed with formaldehyde and crosslinking was stopped by adding glycine. The fixed tissue was then grounded to powder before re-suspending in nuclei isolation buffer to obtain a suspension of nuclei. The purified nuclei were digested with DpnII and marked by incubating with biotin-14-dATP. Biotin-14-dATP from non-ligated DNA ends was removed owing to the exonuclease activity of T4 DNA polymerase. The ligated DNA was sheared into 300−600 bp fragments, and then was blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pull down. Finally, a total of 25.93 Gb Hi-C data (~150.8 coverage) was generated after sequencing on an MGISEQ. 2000 platform (Table 1).
De novo genome assembly. The raw reads were cleaned and filtered by fastp v0.19.4 7 with default parameters. The genome size of C. cunea was estimated using GenomeScope v1.0.0 8 and the 17-mer distribution was analyzed via Jellyfish v2.3.0 9 . Based on the total number of 5,232,854,730 17-mers and a peak 17-mer depth of 33, the genome size of C. cunea was estimated to be 158.57 Mb, and the estimated heterozygosity rate was approximately 0.80%.
We first downloaded the mitochondrial genome of Sclerodermus guani from NCBI (GenBank: MH748670.1) as a reference. The ONT reads were aligned to the mitochondrial genome reference using Minimap2 v2.1 10 . The aligned reads were self-corrected using Canu v2.1.1 11 . The corresponding mitochondrial reads were manually assembled into a circular mitochondrion and polished using Nextpolish v1.3.1 12 by short reads. To assemble the nuclear genome of C. cunea, NextDenovo v2.3.1 (https://github.com/Nextomics/ NextDenovo) was used for genome assembly with the default parameters. Because mitochondria can cause over-polishing to the nuclear mitochondrial (NUMT) region, it is important to include the mitochondrial genome before genome polishing 13,14 . The mitochondrial sequence was tandemly replicated twice and added as a single contig to the end of the assembly. We performed two rounds of long reads correction using Racon v1.4.28 15 and one round of short reads correction using Nextpolish. The mitochondrial contig was discarded after the polish was finished. Purge_dups v1.2.5 16 was used to remove haplotypic duplication. Next, 3D de novo assembly (3D-DNA) 17 was used to assemble each draft into a candidate chromosome-length assembly. Juicebox v1.22 18 viewed the HiC-map of each assembly. The completeness of the assemblies was evaluated by the BUSCO v4.1.4 19 (in the insecta_Odb10 database 20 ) to be 98.70%, including 1,323 single-copy genes and 26 duplicated genes (Table S2). For gap-filling and telomere discovery, the error-prone long reads were corrected first by Fmlrc2 21 with short reads and then self-corrected using Canu. The corrected ONT reads > 50 KB were selected. The non-repeat region of surrounding sequences of both sides of the gap was used as a query to search the raw draft assembly and the corrected long reads using blastn and MUMmer v3.5 22 . The ends of super scaffolds were queried to the corrected reads to find reads containing telomeric repeat. The Hi-C heatmap of our final assemblies was plotted and visualized by hicPlotMatrix v3.6 23 . The final version of the genome assembly of C. cunea was highly contiguous with a total assembly length of 171.99 Mb contained in Repeat annotation. The filtered library was combined by RepBase and Dfam v3.2 24 referring to Insecta. This data set was then used as a library to search the TEs on C. cunea using RepeatMasker v4.1.2 25 . Species-specific repeat library was generated using RepeatModeler v2.1.0 (http://www.repeatmasker.org/RepeatModeler/) with default settings and these identified repeat elements were classified using a reference-based similarity search against a species-specific repeat library in RepeatMasker. Next, the annotation results of repeat elements were merged against the filtered library and species-specific repeat library. To estimate sequencing depth, ONT long reads were aligned back to the assembly using minimap2. The depth of repeat and non-repeat regions was then calculated using bedtools. 26.06 Mb repeat sequences had been identified, accounting for 15.15% of the C. cunea genome, including DNA transposons (5.90%), unclassified (5.00%), retroelements (3.73%), LTR elements (2.60%), LINEs (1.12%), and simple repeats (0.16%). Rolling-circles and SINEs accounted for a small proportion of repetitive elements ( Table 3, Table S1). Upon investigating the sequencing depths, we found that the mean depth within the repeat regions was approximately 243.82x, and for the non-repeat regions, it was about 261.07x. The modestly lower depth within the repeat regions, as compared to the non-repeat regions, suggests that there is no obvious collapse in the assembly. These repeat-sequences on all scaffolds were masked before genome annotation.

Protein-coding genes (PCGs) prediction and other annotation of the genome. Protein-coding
genes prediction was performed using three strategies: transcriptome-based prediction, homology-based prediction and ab initio prediction 26 . For transcriptome-based prediction, RNA-sequencing data derived from three larvae and adults were assembled separately using Hisat2 v2.   www.nature.com/scientificdata www.nature.com/scientificdata/ manuals. Finally, all annotation results from the three strategies were integrated using EVidenceModeler (EVM) v1.1.1 34 . 12,258 protein-coding genes, 10,547 3′ -UTRs, and 10,671 5′ -UTRs were identified in this genome, the completeness of the annotation was evaluated by the BUSCO to be 98.70%, including 1,318 single-copy genes and 32 duplicated genes (Table S2). 11,277 genes (92.00%) encode proteins with at least one known domain in the InterPro database 35 .

Chromosome synteny.
To identify collinear gene blocks between C. cunea and N. vitripennis, all protein coding genes were screened on genome-scale orthologue alignment (by blastp) using MCScanX 39 , and the syntenic genes of the high-quality blocks were visualized in Tbtools v0.58 40 . Syntenic relationships presented that 13,809 genes were linked between C. cunea and N. vitripennis (Fig. 2b). This demonstrated that the genome of C. cunea is highly homologous with that of N. vitripennis. Strong collinearity between these two species revealed conserved chromosomal characteristics.  (Fig. 4). A total of 314 genes belonging to 54 families were identified as specific to C. cunea. Compared with other chalcidoids, C. cunea has fewer unique paralogs and annotated genes (Fig. 4). 2,986 single-copy genes were selected for phylogenetic tree construction. Protein sequences of the identified single-copy genes were aligned by MAFFT v7 42 , followed by Aliscore v02.2 and Alicut v2.31 43 to remove the sites with weak phylogenetic signals.
Contraction and expansion of gene families. OGs evolution (expansion and contraction) was calculated to birth and death rates using CAFE v5.0 48 with the lambda parameter based on the tree above. In C. cunea, 11,691 (95.40%) genes were clustered into 9,204 gene families, of which 328 gene families were expanded, 360 gene families were contracted, and 54 (37 expansions and 17 contractions) gene families experienced rapid evolution. These gene functions were illustrated by comprehensive databases, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). Functional annotation was performed using emapper v2.1.3 with hmmer v3.3.2 sequence searches strategy and assigning to eggNOG orthology data of Insecta [49][50][51] . The significantly expanded families were involved in various molecular functions and biological processes (Fig. 5, Table S5), including trypsin-like serine protease, odorant receptor, gustatory receptor, sugar transporter, cytochrome P450, ABC transporter and alpha/beta hydrolase encoding genes. The families with significant contractions mainly include the DDE superfamily endonuclease, putative peptidase, and hAT family C-terminal dimerisation region (Table S6). All gene enrichments of the KEGG pathway and GO terms were executed by R package clusterProfiler v4.1.1 (Fig. 5) 52 .

Data Records
All raw data of the whole genome have been deposited into the National Center for Biotechnology Information (NCBI) under BioProject accession number PRJNA881787.
The genomic sequencing data were deposited in the SRA at NCBI SRR21622056 53 and SRR21655425 54 . The transcriptomic sequencing data were deposited in the SRA at NCBI SRR21620783, SRR21621373 and SRR21621151 55-57 .
The Hi-C sequencing data were deposited in the SRA at NCBI SRR21620083 58 . The final genome assembly was deposited in GenBank at NCBI JAOPFQ000000000 59 .
Results from the genome assembly and annotation are available in figshare 60 .

Technical Validation
Quality evaluation of genome assembly. We assessed the quality of C. cunea genome assembly in the three aspects. Firstly, the Core Eukaryotic Genes Mapping Approach (CEGMA) defined 458 core eukaryotic genes, and 248 of them were the most highly-conserved core genes, which could be used to assess the completeness of the genome or annotations. We identified that 245 (98.79%) core genes have homologous genes in the highly-conserved core gene sets. Secondly, the completeness of protein-coding gene predictions was evaluated by the BUSCO (in the insects_odb10 database) completeness to be 98.70% (96.4% single-copied genes and 2.3% duplicated genes), 0.1% fragmented, and 1.2% missing. Thirdly, sequencing data were mapped to the genome to verify the accuracy, yielding mapping rates of 99.90% for MGI, and 98.85% for ONT data.   Fig. 4 The maximum likelihood phylogenetic tree based on 2,986 concatenated single-copy orthologous genes from 13 hymenopterans. The bootstrap value of all nodes is supported at 100/100, and gene counts different types of orthologous groups. The expansion, contraction and rapid evolution numbers (red) of orthologous groups (OGs) are shown on the nodes and tips. "1:1:1" indicates universal single-copy genes present in all species; "N: N: N" indicates multicopy genes, although the absence in a single genome is tolerated; "Chalcidoidea" means common unique genes in species from Chalcidoidea. "Species-specific" represents species-specific genes in the genome; "Unassigned" indicates genes which cannot be assigned into any gene families (orthogroups); "Others" means the remaining genes.